19. Clean (Intro)

Clean: Intro

Improving Quality and Tidiness

Clean: Intro

Cleaning means acting on the assessments we made to improve quality and tidiness.

Improving Quality

Improving quality doesn’t mean changing the data to make it say something different—that's data fraud.

Consider the animals DataFrame, which has headers for name, body weight (in kilograms), and brain weight (in grams). The last five rows of this DataFrame are displayed below:

Examples of improving quality include:

  • Correcting when inaccurate, like correcting the mouse's body weight to 0.023 kg instead of 230 kg
  • Removing when irrelevant, like removing the row with "Apple" since an apple is a fruit and not an animal
  • Replacing when missing, like filling in the missing value for brain weight for Brachiosaurus
  • Combining, like concatenating the missing rows in the more_animals DataFrame displayed below

Improving Tidiness

Improving tidiness means transforming the dataset so that each variable is a column, each observation is a row, and each type of observational unit is a table. There are special functions in pandas that help us do that. We'll dive deeper into those in Lesson four of this course.

Programmatic Data Cleaning Process

Clean: Programmatic Data Cleaning Process

The programmatic data cleaning process:

  1. Define
  2. Code
  3. Test

Defining means defining a data cleaning plan in writing, where we turn our assessments into defined cleaning tasks. This plan will also serve as an instruction list so others (or us in the future) can look at our work and reproduce it.

Coding means translating these definitions to code and executing that code.

Testing means testing our dataset, often using code, to make sure our cleaning operations worked.